1 Overview

The ZIFCO project studies many facets of respiratory infections in Germany. Participants of the German National Cohort (Nationale Kohorte, NAKO) were asked a series of questions, such as whether they certain symptoms, via the app PIA. In parallel, blood and swab samples were taken and analyzed by different labs.

Here we process raw data sets generated by ZIFCO and produce two series of cleaned data sets. The first is meant for further analyses in a format most convenient for processing in scientific programming languages such as R or Python. The second series of data sets, derived from the first, fulfills all criteria for an upload in NAKO’s central data base and is accompanied by a meta-data table, as required.

The data processing and export is performed with a suite of custom R scripts. Where applicable the processing steps have been programmed in a general and flexible manner. Further steps are specific to specific data sets, sometimes to certain data entries.

2 Data sets

2.1 Description

In this report data sets are referred to their names used during processing. These defined in the configuration file, see section One central configuration file for a preview of the configuration and the link between data set names and data-set files. For example, the name nasal_swabs_pcr refers to the data set contained in the file lab_results.csv.

Two data sets are exports from the PIA questionnaires, answersis the direct export of 15 December 2022, answers_backup a data base export. The latter is used because some questionnaire had been inadvertently deleted and thus must be added to the normal export.

pia_codebook is the code book of PIA: it contains the list of questionnaires, questions, follow-up questions, possible answers, as well as a number of meta data on these variables.

consent lists whether participants gave different consents, as well as some are in fact test entries and should be removed during processing.

cpt_hub and cpt_pia refer to blood samples as recorded by HUB and PIA, respectively. They are merge into a single data set cpt during processing.

examination contains examination dates for a data freeze of 26 July 2022.

nasal_swabs_pcr contains the participant pseudonym, sample ID’S, and PCR results of tests against a variety of respiratory viruses.

pbmc, plasma and swabs are the records of the corresponding samples at HUB. pbmc is obtained by merging the two raw data sets pbmc_1 and pbmc_2 during processing.

samples is a lookup table for matching sample ID’s and participant pseudonyms.

During processing, three further tables are generated for matching samples and participants, ids_lookup, ids_lookup_1, ids_lookup_2, see below for details.

Note that currently only nasal_swabs_pcr contains results of sample analyses.

2.2 Previews before and after processing

The tabs below show previews of all raw and processed data sets. The data formatted for NAKO as well as the corresponding meta-data are shown as well. For each data set beside code book and NAKO meta data a random sample of 1000 entries has been chosen, different entries are shown before and after processing.

2.2.1 Raw

Note that dates and date-times are encoded as numbers in Excel and only in Excel displayed as proper dates. For example, July 5, 2011 will be encoded as 40729. We import data as close to raw as possible, the dates in Excels are imported as strings of numbers.

answers

[no preview of individual- or sample-based data in non-confidential report]

answers_backup

[no preview of individual- or sample-based data in non-confidential report]

cpt_hub

[no preview of individual- or sample-based data in non-confidential report]

cpt_pia

[no preview of individual- or sample-based data in non-confidential report]

examination

[no preview of individual- or sample-based data in non-confidential report]

nasal_swabs_pcr

[no preview of individual- or sample-based data in non-confidential report]

pbmc_1

[no preview of individual- or sample-based data in non-confidential report]

pbmc_2

[no preview of individual- or sample-based data in non-confidential report]

plasma

[no preview of individual- or sample-based data in non-confidential report]

samples

[no preview of individual- or sample-based data in non-confidential report]

swabs

[no preview of individual- or sample-based data in non-confidential report]

pia_codebook

2.2.2 Processed

answers

[no preview of individual- or sample-based data in non-confidential report]

pia_codebook

examination

[no preview of individual- or sample-based data in non-confidential report]

nasal_swabs_pcr

[no preview of individual- or sample-based data in non-confidential report]

plasma

[no preview of individual- or sample-based data in non-confidential report]

samples

[no preview of individual- or sample-based data in non-confidential report]

swabs

[no preview of individual- or sample-based data in non-confidential report]

ids_lookup_1

[no preview of individual- or sample-based data in non-confidential report]

ids_lookup_2

[no preview of individual- or sample-based data in non-confidential report]

ids_lookup

[no preview of individual- or sample-based data in non-confidential report]

pbmc

[no preview of individual- or sample-based data in non-confidential report]

cpt

[no preview of individual- or sample-based data in non-confidential report]

2.2.3 Data for NAKO

Note that cells with missing values are shown as empty here but contain null in the exported CSV files.

nasal_swabs_pcr

[no preview of individual- or sample-based data in non-confidential report]

swabs

[no preview of individual- or sample-based data in non-confidential report]

plasma

[no preview of individual- or sample-based data in non-confidential report]

cpt

[no preview of individual- or sample-based data in non-confidential report]

pbmc

[no preview of individual- or sample-based data in non-confidential report]

2.2.3.1 _Beobachtung_Atemwege {-}

[no preview of individual- or sample-based data in non-confidential report]

2_Beobachtungsfragebogen

[no preview of individual- or sample-based data in non-confidential report]

2.2.3.2 2__Beobachtungsfragebogen_AGI {-}

[no preview of individual- or sample-based data in non-confidential report]

2.2.3.3 2__Beobachtungsfragebogen_ARI {-}

[no preview of individual- or sample-based data in non-confidential report]

2.2.3.4 2__Beobachtungsfragebogen_HWI {-}

[no preview of individual- or sample-based data in non-confidential report]

Beobachtung_Harnwege

[no preview of individual- or sample-based data in non-confidential report]

Beobachtung_Magen_Darm_Trakt

[no preview of individual- or sample-based data in non-confidential report]

Beobachtungsfragebogen_

[no preview of individual- or sample-based data in non-confidential report]

Demografie

[no preview of individual- or sample-based data in non-confidential report]

Gesundheitszustand

[no preview of individual- or sample-based data in non-confidential report]

Nahrungsmittelunvertr_glichkeiten

[no preview of individual- or sample-based data in non-confidential report]

Regionsfragebogen

[no preview of individual- or sample-based data in non-confidential report]

Spontanmeldung

[no preview of individual- or sample-based data in non-confidential report]

2.2.3.5 Spontanmeldung__2__Beobachtungsfragebogen_ {-}

[no preview of individual- or sample-based data in non-confidential report]

2.2.3.6 Spontanmeldung__Beobachtung_Atemwege {-}

[no preview of individual- or sample-based data in non-confidential report]

2.2.3.7 Spontanmeldung__Beobachtung_Harnwege {-}

[no preview of individual- or sample-based data in non-confidential report]

2.2.3.8 Spontanmeldung__Beobachtung_Magen_Darm_Trakt {-}

[no preview of individual- or sample-based data in non-confidential report]

Spontanmeldung_Beobachtungsfragebogen

[no preview of individual- or sample-based data in non-confidential report]

2.2.3.9 Spontanmeldung__Symptome_Atemwege {-}

[no preview of individual- or sample-based data in non-confidential report]

2.2.3.10 Spontanmeldung__Symptome_Harnwege {-}

[no preview of individual- or sample-based data in non-confidential report]

2.2.3.11 Spontanmeldung__Symptome_Magen_Darm_Trakt {-}

[no preview of individual- or sample-based data in non-confidential report]

2.2.3.12 Spontanmeldung__Symptomfragebogen {-}

[no preview of individual- or sample-based data in non-confidential report]

Symptome_Atemwege

[no preview of individual- or sample-based data in non-confidential report]

Symptome_Harnwege

[no preview of individual- or sample-based data in non-confidential report]

Symptome_Magen_Darm_Trakt

[no preview of individual- or sample-based data in non-confidential report]

Symptomfragebogen

[no preview of individual- or sample-based data in non-confidential report]

Technikbereitschaft

[no preview of individual- or sample-based data in non-confidential report]

2.2.3.13 Tierkontakte__Kontakte_zu_Kindern_und_Reisen {-}

[no preview of individual- or sample-based data in non-confidential report]

2.2.4 Variable names for NAKO

2.2.5 Meta-data for NAKO

3 Configuration

3.1 One central configuration file

Many of the options which users might want to change are listed in the configuration file config.yml. This is a text file that can be opened and edited with any text processing software. (Clicking on the link opens it the browser but doesn’t allow one to edit it.) On Windows, Notepad++ offers convenient functions, but for example Wordpad can be used as well.

The extension .yml indicates it is structured as a YAML file. It is similar to JSON, but more flexible. It allows the rapid definition and editing of structured, nested data. The main rules are the following:

  • the file should start and end with three minus signs —
  • each element is designated by its name, without spaces, followed by a colon
  • sub-elements, i.e., element that belong to the first, follow the same rule but appear below the first with an indentation of two or more spaces
  • this can be done recursively any number of times
  • the value of the element appears after the colon
  • a set of values can be indicated between square brackets and separated with a comma
  • character strings can be written surrounded by quotation marks but don’t have to
  • comments can be written anywhere and have to be preceded by #

All directory paths should be written with a slash / separating directories instead of the Windows standard \.

Please see the accompanying file config.yml. The resulting information can be visualized as a nested list. Below is a visualization of the configuration used to process the data (it doesn’t correspond to the syntax used in the file):

One notable feature is the possibility of mapping many variables across data sets to one variable which will have the same properties everywhere. For example, in the configuration above, all three variables Proband, user_id, Pseudonym PIA are mapped to participant_id which will have the same title and description in all data sets (expand as follows: object > variables > participant_id > original).

3.2 Configuration properties

The configuration contains the following elements:

  • pipeline_ouput: what to do with all the variables and data generated when running the processing pipeline
    • save: true or false, whether to save the output
    • file: string, path to the file where the output is saved
  • convert_time_from_germany: true or false, whether to keep the date-times as they are in the data set assuming the UTC time, or to assume they were measured in Germany and convert them to UTC, accounting for winter and summer times (as of 2022)
  • match_via_hub: true or false, whether to use the participant pseudonyms provided by HUB as intermediate keys to match final participant pseudonyms and sample ID’s
  • remove_variables: a set of strings, variables to remove from the data sets during processing
  • remove_entries_with_values: a list of variables, each associated with a set of values for which any corresponding entry in any data set is removed during processing
  • remove_entries_ambiguous_rna_samples: true or false, whether to remove RNA swab samples (identified by sample ID’s “zifco-11” followed by 8 digits) with which no participant can be associated unambiguously
  • nako: a list of configurations specific to the export for the NAKO
    • application_number: number or string, the NAKO application number, used to name variables as well as export folder
    • keep_datasets: set of strings, names of data sets to export
    • remove_vars: set of strings, variables to remove from the export
    • guess_scale_level: true or false, whether the program should guess the scale level (metric, ordinal, nominal, …) of a variable based on it’s type (integer, character, date, …)
    • replace_formatting_char_with: string, which replaces the characters used for formatting CSV in the text fields, so as to avoid breaking the format of the NAKO exports
    • missing_codings: string, how missing values are to be encoded; currently only the value "-1" is accepted
    • missing_options: string, what to print in the NAKO meta-data as option (fixed set of possible values that a variable can take); currently only the value "-1 = Fehlende Rohdaten" is accepted
    • export_path: string, path to directory were the NAKO files should be written
    • generate_samples: true or false, whether to generate samples of the tables exported for the NAKO
  • dataset_types: list of broad categories of data-sets, currently used for organizing the NAKO export
    • {data-set-type name}: list of configurations for this data-set type
    • title: string, title of the data-set type; used for NAKO export
    • description: string, description of the data-set type; used for NAKO export
  • datasets: a list of the data-sets to be imported, processed, and/or exported
    • all_data_sets: a list of configurations that apply to all data sets; each can be omitted or set to null and defined for each data-set individually instead
      • raw_dir: string, directory where to find the raw-data-set files,
      • save_native: true or false, whether to save the imported raw data sets in the R-native format RDS
      • read_native: true or false, whether to import raw data sets in the R-native format RDS instead of the original files (can speed up to import significantly)
      • native_dir: string, path to the directory where the raw data in RDS format should be saved or read from
      • save_processed: true or false, whether to save the processed data sets (automatically in the R-native format RDS)
      • processed_dir: string, path to directory where the processed data sets are saved
    • {data-set name}: list of configurations specific to each data set
      • raw_data_file: string, path to original file or folder containing the raw data-set; is appended to raw_dir defined above
      • raw_data_native_format_file: string, path to native (RDS) file containing the raw data-set; is appended to native_dir defined above
      • dataset_type: string, which data-set type does this data set belongs to; this should be the same name {data-set-type name} as appears in the list above
      • title: string, title of the data-set; used for NAKO export
      • description: string, description of the data set; used for NAKO export
  • variables: a list of variables for which specific configurations apply; other variables present in the data sets but omitted here will be processed in a default manner
    • {variable name}: list of configurations specific to this variable, all are optional (can be omitted or set to null) beside original
      • original: set of strings, variable names, across data-sets, to be mapped to this one variable
      • type: string, either boolean, character, date, date_time, integer, or float; if omitted or null, the variable is treated as character
      • unit: string, the unit of the quantity stored in the variable; used for NAKO export
      • scale_level: string, “scale level” (Skalenniveau) of the variable; can be either “metrisch”, “nominal”, “ordinal”, “hierarchisch” or “Text”, used for NAKO export
      • dataset: string, the name, as listed above, of the data set to which the original variable belongs; this is relevant when two variables in two data sets have the same name but should be mapped to different variables

4 Data-processing pipeline

The data-processing pipeline consists in six main steps:

  1. import of raw data
  2. preparation of PIA code book
  3. standardization of variable names
  4. cleaning of variable values
  5. data-set operations: filtering, merging and other transformations
  6. transformation and export for the NAKO

Furthermore various checks and statistics are computed to be used in this reports.

All this is executed from the R script pipeline.R, which itself calls all necessary functions, defined in the various scripts in the R folder.

4.1 Import of raw data

Depending on the configuration, the raw data sets are imported either from the original files or from files in the R-native RDS format. The latter are obtained by importing from the original file once in R and saving from R in the RDS format.

The original files can be either RDS, JSON, Excel, or CSV files. In the first two cases variable types are conserved; in the latter two cases, values are imported as characters (at this stage it’s more stable and less error-prone not to try and automatically guess the types). If raw_data_file in the configuration file points towards a folder, then all files within it are imported in one list (each file is imported as one element of the list).

4.2 Preparation of PIA code book

The code book is converted from a nested list to a table with only relevant variables kept. The variable types are set explicitly. The questionnaire version is read from the corresponding file name: if the latter ends with “(x)”, where “x” is a number, then th version is x+1, otherwise the version is 1. For example, “2. Beobachtungsfragebogen AGI.json” is version 1 of that questionnaire, while “2. Beobachtungsfragebogen AGI(4).json” is version 5.

A variable “question_single” which identifies unique questions is built to match the corresponding variable in answers. It is the concatenation of questionnaire_name, “_v”, questionnaire_version, “*f”, question_position, “*”, answer_position.

The variable is_decimal is discarded because it is not obvious to interpret: it seems to be either missing or FALSE, the latter always for enumerable variables usually starting with “How many days …”. The variable answer_type_id is left, but what it means is unclear as well: does it give the type of the answer, like integer, character etc.? It is never missing and can be any integer between 1 and 8.

Lastly, two variables look at the order of the possible answers:

  • answer_level counts the number of the answer (is it the first, the second, …) when iterating through the code book
  • answer_position is directly read from the field position of the code book

Both should always be the same but are kept for consistency checks. Are they currently identical? True.

4.3 Standardization of variable names

Variable names are replaced according to the config file: each time a variable listed under the original property appears (possibly, in selected data sets), it is replaced by the name of the higher-level element. In our previous example, each time a variable is called Proband it replaced by proband_id.

When variables don’t appear in the configuration, a default standardization is applied (the function make_clean_names of the R-package janitor): unique and consist only of the _ character, numbers, and letters.

At the end, it is made sure that two variables that haven’t been listed under the same name don’t have the same name attributed to them.

4.4 Cleaning of variable values

The cleaning step consists itself in two steps:

  • formatting of values according to data-set and variable specific rules
  • type formatting and conversion

4.4.1 Specific formatting

The values of (see section Standardization of variable names for original names and data sets):

  • initial_quantity have comma replaced with dot, ” g” removed, value “99,00 g” replaced with missing;
  • remaining_quantity have comma replaced with dot and ” µl” removed;
  • concentration have comma replaced with dot and ” xE” replaced with “e” (which is then interpreted as 10^6);
  • consent_blood_sample_collection, consent_result_communication, consent_sample_collection, test_participant have “Ja” or “ja”, respectively “Nein” or “nein”, replaced with TRUE respectively FALSE; other values are replaced with missing;
  • collection_date, delivery_date, analysis_date, reporting_date, questionnaire_date, answer_date have the suffix “:00” added if, when their digits are removed, they are equal to “.., :”; this allows for proper date-time conversion later
  • answer_is_mandatory have “t” respectively “n” replaced with TRUE respectively FALSE; other values are replaced with missing;
  • participant_id, participant_id_hub, sample_id, bakt_sample_id are tested and replaced with missing if they don’t fit the predefined formats (see details in section Format of ID variables)

4.4.2 Type formatting and conversion

Each variable is then converted to the type as indicated in the config file. If no type is indicated, no conversion is applied, the variable stays as it is. (Character when importing from Excel or CSV.)

The following types are directly converted in R: “boolean” (called logical in R), “character”, “integer”, “float” (double precision in R).

The two other possible types, “date” and “date_time”, require more processing. We allow for half of the entries to be ill-formatted. The values are thus assumed to be either:

  • for more than half of them, integers or strings containing only digits, which is the internal numeric representation of dates in R and other languages;
  • if the original raw-data file was Excel (identified by extension .xls or .xlsx), for more than half of them, numeric or strings containing only digits and possibly one dot; we assume Excel for Windows, Excel 2016 for Mac, or Excel for Mac 2011 was used (other reference used for earlier versions for Mac);
  • or mostly well formatted strings; then we try to guess using the function parse_date_time of the R package lubridate.

Date encoding and decoding in Excel are discussed on Microsoft’s website and Stackoverflow.

If the format is “date”, the variable is converted to the date format, i.e., only contains information on year, month and day. Otherwise, if the option to convert from German time is set to true in the config file, then the time zone is assumed to be Germany with daylight saving times as of 2022: we first read setting time zone to UTC, and then have to remove one or two hours depending on daylight saving.

4.5 Data-set operations: filtering, merging and other transformations

The following transformations are then applied to the data sets.

The variables sample_id and bakt_sample_id are separated, the latter renamed sample_id, and appended to the data set. This is relevant for the data set samples, see the previews of it before and after processing in the section Previews before and after processing. Note that although the direct link between sample_id and bakt_sample_id is broken, no information is actually lost, as they are still both linked through participant_id. See section RNA tests on nasal swabs for details and difficulties.

The entry of cpt_hub with HUB participant ID participant_id_hub equal to “hzif0386” and no delivery date is removed, as according to a colleague it was duplicated by error.

Test participant as identified in the data set consent are removed from all data sets.

Variables (currently ids and comment) and individual entries (currently a set of individual order number (Auftragsnr), see section Nasal swabs) flagged in the config file are removed from all data sets.

Samples considered irrelevant are removed from all data sets. This can be the case when:

  • one sample is linked to two different participants while having status “nicht genommen” (not taken) for one of the two; apparently it could happen that the same sample ID could be re-used for a participant following another one from whom no sample was taken contrary to what had been foreseen;
  • if, after having been set to lower case, a sample ID starts with either “rsist-” or “rfee-”; these correspond to other studies and shouldn’t be part of the ZIFCO data sets.

answers_backup is converted to the same format as answer (e.g., one question ID containing questionnaire name and question position instead of having them separated in two columns, or having the same column names).

Valid answers only are retained from multiple answers in the filled PIA questionnaires answers and answers_backup, see section Multiple answers for details.

Only the answers corresponding to the questionnaires “Spontanmeldung”, “Regionsfragebogen”, “Symptome Atemwege”, “Spontanmeldung: Symptome Atemwege” are kept in answers_backup, as these are apparently the only one missing from the normal export answers (this hasn’t been checked). They are then added to answers.

The variables answer_values and answer_values_code are removed from answers as they are in the code book. All answers for which we can’t find the corresponding question and answer options in the code book are removed.

Lookup tables, matching participant pseudonym and sample ID’s are generated and one defined as the one to be used. See section Matching participant and sample ID’s for details.

cpt_pia and cpt_hub are merged via sample and participant ID’s to produce the data set cpt.

pbmc_1 and pbmc_2 are merged via sample and participant ID’s to produce the data set pbmc.

A column containing the participant pseudonym is added to all data sets where it is not already present by matching to the samples via the lookup table. In case a participant pseudonym are present in the data set but missing for some entries, if a non-missing pseudonym is found after matching with a sample, then the non-missing value replaces the missing one.

4.6 Export for NAKO

4.6.1 Overview

The processed data sets are further transformed to meet the NAKO’s requirements, meta-data are generated, and all are exported as CSV’s in the required formats. See the previews in the sections Data for NAKO, Variable names for NAKO and Meta-data for NAKO.

One specific further data-set operation is the merging of answers with pia_codebook: the information on each question and answers are added in answers itself because the NAKO wants data sets that all have a one participant per row. (An alternative would be to print question text, answer options, etc., in the NAKO meta data, but this would be very cumbersome to use and possibly not possible, as the text in Excel cells might exceed a maximum allowed size.)

Moreover, since the data is exported as CSV, one needs to be careful not to use as data characters that are used for CSV formatting. Thus, given NAKO’s requirements, the string for end-of-line “\n” and for new cell “;” are replace by a string as defined in the config file.

After that, the answers are split in individual questionnaires.

The requirements on how to contribute data to NAKO are described (in German) in the document “TFS-Info-12a”, see the accompanying file, also available online.

Briefly:

Meta-data, describing individual variables, have to be provided in one of two specific formats (Excel template or CSV). They consist of one table with, titles and descriptions of data sets and variables; names, units, “scale level” (whether ordinal, nominal, …), and “option” (for variables that have a fixed set of possible values, what those are).

Data are provided as CSV with a few specific requirements.

Variable names have to start with “u” followed by the NAKO-application number and “_”, and overall have 20 or less characters. Moreover, missing values are not directly written in the column for a given variable, but rather as a separated variable in a separated column. It had the same name as the original variable with the suffix”_m” added. Thus, if all variable names have to be shorter than 20 characters, then those of the original variables should actually be 18 or less character long.

To ensure that two different variables have different names after being shortened, a suffix with “_” and a number is added to the 16 first characters. (This works if their are 9 or less different variables with identical names after shortening.)

In the column containing information or results (i.e., those without the suffix “_m”) the value null is used to indicate where a value is missing. Empty fields, e.g., because there is no follow-up question, are left empty.

In the columns containing information on the missing values (i.e., the columns with the suffix “_m”), the latter should be encoded as a negative integer with the meaning of each possible value described in the column “Option”. The document suggests an encoding with four values, “-1”, “-2”, “-3”, “-4”; at the moment everything is “-1” = “Fehlende Rohdaten”. This seems to correspond to the vast majority of the missing values appearing in the exported data sets.

4.6.2 Samples

If the option is set in the config, samples of the tables for the NAKO are generated as CSV in the same format as the files for the NAKO. Parameters such as number and size of samples are set in the script sample-nako-export.R.

4.6.3 Questionnaire and variable names

Two further tables are exported: the list of questionnaire names as well as a dictionary of variable names (correspondance of names as defined in config or standardized on the one end, as they are used for NAKO on the other).

4.6.4 Warnings concerning reading the CSVs in Excel

Excel can open CSVs but two displaying problems have been encountered:

  • special characters, in particular accentuated German characters are “glitchy”, e.g., “K√§ltegef√ºhl” instead of “Kältegefühl”; this is probably because of the imposed encoding in UTF-8, which is standard almost everywhere and required by the NAKO; characters are displayed properly elsewhere, including Word with encoding UTF-8
  • Excel tries to interpret “-1 = Fehlende Rohdaten” in the meta-data as an equation and adds a “=”sign at the beginning of the corresponding cell…!

You can use Excel to view the CSVs, but do not save them from Excel!

5 Descriptive statistics

5.1 Variable names

The following table shows how often original variable names appear within and across the raw data sets:

5.2 Comments

For safety the different comment columns are currently removed during processing. They might contain information on the quality of samples, and currently don’t seem to contain sensitive information, but they are difficult to use in a systematic way. Closer inspection is needed.

Below are entries with a comment variable and non-missing values:

cpt_pia

[no preview of individual- or sample-based data in non-confidential report]

nasal_swabs_pcr

[no preview of individual- or sample-based data in non-confidential report]

pbmc_2

[no preview of individual- or sample-based data in non-confidential report]

samples

[no preview of individual- or sample-based data in non-confidential report]

5.3 Format of ID variables

After visual inspection and discussions:

The PIA/NAKO participant ID’s participant_id have one format. After converting to lower case: “l3pia[9 digits]”

The HUB participant ID’s participant_id_hub have two formats. After converting to lower case: “hzif[3 digits]” and “hzif[4 digits]”

The sample ID’s sample_id have different formats for different analyses. After converting to lower case:

  • nasal swabs with UTM: “zifco-10[8 digits]”
  • nasal swabs with RNA: “zifco-11[8 digits]”
  • CPT: “zifco-12[8 digits]”
  • PBMC: “na[10 digits]”
  • plasma: “[9 digits]” and “[10 digits]”

Moreover, two further sets of samples were found that matched other studies and were removed from all data sets. After converting to lower case: starting with “rsist-” and starting with “rfee-”.

Example of ID’s: participant_id: L3pia506545651, l3pia986165890, l3pia334237262, l3pia069361605, l3pia459107441, l3pia797835460; participant_id_hub: HZIF1056, HZIF0680, HZIF0603, HZIF533, HZIF1317, HZIF0739; sample_id: 340528686, zifco-1249102811, ZIFCO-1076497429, ZIFCO-1009011348, 222115221, ZIFCO-1036456062.

The following values were ill-formatted and replaced with missing values: participant_id: l3pia0987654321; participant_id_hub: ZIFCO_Proband_01, ZIFCO_Proband_02; sample_id: 12081497051, A, A2, B, B2, C, C2.

The corresponding data entries are:

answers

[no preview of individual- or sample-based data in non-confidential report]

cpt_hub

[no preview of individual- or sample-based data in non-confidential report]

pbmc_1

[no preview of individual- or sample-based data in non-confidential report]

pbmc_2

[no preview of individual- or sample-based data in non-confidential report]

plasma

[no preview of individual- or sample-based data in non-confidential report]

swabs

[no preview of individual- or sample-based data in non-confidential report]

5.4 Missing sample ID’s

After processing, the following data sets have these percentages of entries that have missing sample ID: answers: N.A., pia_codebook: N.A., examination: N.A., nasal_swabs_pcr: 0%, plasma: 0%, samples: 0%, consent: N.A., swabs: 0.1%, ids_lookup_1: 0%, ids_lookup_2: 0.01%, ids_lookup: 0%, pbmc: 0%, cpt: 0.26%. Here are the corresponding entries:

swabs

[no preview of individual- or sample-based data in non-confidential report]

ids_lookup_2

[no preview of individual- or sample-based data in non-confidential report]

cpt

[no preview of individual- or sample-based data in non-confidential report]

5.5 Matching participant and sample ID’s

5.5.1 Overview

participant_id is the PIA ID which was provided by NAKO and thus the primary participant ID. participant_id_hub was given by labs upon sample analysis. Thus we don’t necessarily expect a one-to-one matching of participant_id and participant_id_hub. Samples should have always exactly one associated participant_id and at least one participant_id_hub.

The matching was already done in the data set nasal_swabs_pcr. The cpt data set was comprised of cpt_hub which had sample ID’s and information on samples, and cpt_pia with the same sample ID’s and participant ID’s: thus here as well the matching could be done directly.

There is no data set containing a direct pairs of participant ID’s and sample ID’s as found in PBMC and plasma. The only way to match those to particpants is via the HUB participant ID’s, see below.

5.5.2 Sample ID’s used twice

Sometimes a sample ID for a participant would be re-used for a second participant if, contrary what had been foreseen, a sample was not taken for the first participant. Thus entries with sample_seach tatus missing or “nicht genommen” and corresponding to many different participants in a data set are removed. This was the case in the data set samples.

5.5.3 Overlap swabs data sets

There are two data sets related to swabs: swabs and nasal_swabs_pcr. The first contain only information on samples together with HUB participant ID’s, the second doesn’t have HUB participant ID’s but has analysis results.

Overlap of sample ID’s:

  • percentages of swabs samples in nasal_swabs_pcr: 29.01%
  • percentages of nasal_swabs_pcr samples in swabs: 65.14%

5.5.4 Two methods to create a participant-sample matching

Two approaches are applied to match participants and samples across data sets.

The first method consist in first directly recording the pairs participant_id - sample_id present across data sets. Then to add those found by joining participant_id - participant_id_hub and participant_id_hub - sample_id pairs. (This actually doesn’t come up, as there is no data set with participant_id - participant_id_hub pairs.) And lastly over two joins: participant_id - sample_id with sample_id - participant_id_hub and then with sample_id- participant_id. (Even though all HUB-participant pairs come together with a sample ID, it could be the two samples are linked to the same HUB but only the first is linked to a participant, in which case we can link the second over the associated HUB, then the first sample, then its participant.) Pairs not found directly are considered have been found via the HUB participant ID. This produces the lookup table ids_lookup_1.

The second method consists in building all unique triplets participant_id - participant_id_hub - sample_id present in the data sets. This produces the lookup table ids_lookup_2.

5.5.4.1 Using participant ID’s from HUB

We find that the ID’s given by HUB participant_id_hub don’t uniquely correspond to participants across data sets. Thus they can’t be used to improve matching of sample ID’s sample_id and participant ID’s participant_id.

Looking at sample-participant pairs obtained via HUB participant, we find the samples were already present, and the new participants are either different or missing:

[no preview of individual- or sample-based data in non-confidential report]

These are the same sample-participant pairs as those obtained from the triplets sample-participant-HUB participant that are not in the direct matching ample-participant observed in the data sets:

[no preview of individual- or sample-based data in non-confidential report]

5.5.4.2 Conclusion

In conclusion, without further information, it is safer not to use the HUB participant ID to link (final) participant pseudonym and sample ID. This means leaving the value match_via_hub to false in the config file.

The final lookup table ids_lookup used to match participant and sample is the first one ids_lookup_1 without the participant-sample pairs found only via HUB participant ID’s.

5.5.5 Results

After data set operations (filtering, merging):

Out of 13759 sample ID’s with at list one match in participant ID or HUB participant ID, 0.04% (6) have more than one corresponding participant ID’s and 0% (0) samples have no corresponding participant pseudonym.

Percentages of samples without matching participant in each data set: answers: N.A., pia_codebook: N.A., examination: N.A., nasal_swabs_pcr: 0%, plasma: 0%, samples: 0%, consent: N.A., swabs: 0%, ids_lookup_1: 0%, ids_lookup_2: 0%, ids_lookup: 0%, pbmc: 0%, cpt: 0%.

When a sample has multiple matching participant ID’s across data sets, a missing participant ID is attributed in data sets without participant ID’s. A special rule in applied to RNA tests on swabs, see below.

When a sample doesn’t have a matching participant ID, it is not included in export for NAKO.

5.6 Multiplicity of samples

5.6.1 Overview

For each data set, percentage of sample ID’s that appear more than once (expected is once): answers: N.A., pia_codebook: N.A., examination: N.A., nasal_swabs_pcr: 100%, plasma: 0%, samples: 0%, consent: N.A., swabs: 0%, ids_lookup_1: 0.02%, ids_lookup_2: 0%, ids_lookup: 0.02%, pbmc: 0%, cpt: 0.04%. Some of those are due to missing sample ID’s, which then can appear many times, see above.

5.6.2 Nasal swabs

We expect samples to appear once, except for nasal_swabs_pcr since samples are tested for up to 20 targets.

[no preview of individual- or sample-based data in non-confidential report]

These are at the same time the entries with missing sample ID’s (and most other variables missing as well) and physician values different from the rest. They were identified by their order_number (originally: Auftragsnr) and removed from the data set. Below the corresponding raw data.

However we stil don’t find a consistent number of results for each sample in nasal_swabs_pcr: here is the distribution of multiplicities of sample ID’s: 1 sample(s) with multiplicity 0, 1 sample(s) with multiplicity 3, 70 sample(s) with multiplicity 10, 21 sample(s) with multiplicity 11, 2 sample(s) with multiplicity 12, 1 sample(s) with multiplicity 14, 266 sample(s) with multiplicity 15, 21 sample(s) with multiplicity 16, 54 sample(s) with multiplicity 17. Indeed it can happen that not all samples are tested for all targets.

5.6.3 RNA tests on nasal swabs

Swabs can be analyzed via UTM (ID’s of the form zifco-10[8 digits]) or RNA (zifco-11[8 digits]). There are ambiguities regarding the latter in data set samples: the same sample ID for RNA, originally called Bakt_Proben_ID, sometimes appear many times with different UTM ID’s (originally: Proben_ID) and participant ID’s (Proband). Before any processing, this happens for 6.19% of the entries of the raw samples data set. The variables Status and Bemerkungen might have indices on how to interpret this fact, but no obvious ones. Below the corresponding data:

[no preview of individual- or sample-based data in non-confidential report]

For lack of further information, and given that at the time of writing no results are available for RNA tests, the samples with ambiguous matching of RNA-ID and participant ID are removed from all data sets (samples, swabs). This is done after separating UTM and RNA ID’s in samples, thus no information is lost for the first and the link between the two is maintained via the participant ID.

5.7 Duplicated entries

After processing, number of duplicated entries (if an entry appears twice, it’s counted once) in each data set: no data set has duplicates.

5.8 PIA answers and code book

5.8.1 Comparison of questionnaires

Both data sets answers and answers_backup contain information on individual questions and possible answers. This can be compared to the code book to check that code book and PIA export indeed are compatible.

To compare the questionnaire information contained in pia_codebook, in answers, and in answers_backup, we can join them (after cleaning, before data set operations such as filtering or merging) using the question identifier question_single and keeping only the relevant variables. We add three variables to investigate the agreement:

  • answer_values_cb_answers_agree: whether answer_valuesis the same in the PIA code book and in the PIA answers
  • answer_values_code_cb_answers_agree: whether answer_values_codeis the same in the PIA code book and in the PIA answers
  • answer_values_code_cb_answers_backup_agree: whether answer_values_codeis the same in the PIA code book and in the backup.

Since answer_values from answers_backup is formatted differently and would be cumbersome to parse, we don’t systematically compare it to the other data sets. This is the result:

All questionnaires (including their version) with a mismatch are: “2. Beobachtungsfragebogen AGI_v1”, “2. Beobachtungsfragebogen AGI_v2”, “2. Beobachtungsfragebogen Vaginose_v1”, “Beobachtung Vaginose_v1”, “Corona-Impfungen_v1”, “Fragebogen Impfungen I_v1”, “Fragebogen Impfungen II_v1”, “Impfung Corona _v1”, “Impfung Corona _v3”, “Impfung Corona _v4”, “Lebensqualität_v1”, “Nutzerakzeptanz_v1”, “Spontanmeldung_v1”, “Spontanmeldung_v2”, “Spontanmeldung: Beobachtungsfragebogen Vaginose_v1”, “Symptome Vaginose_v1”.

There is only one true mismatch, where a question of the questionnaire “Spontanmeldung” seems to correspond to one questionnaire version in the code book and another in the PIA answers, presumably follow-up questions were added or removed. Pending further investigation, the corresponding questions (the second question and its follow-ups in the versions 1 and 2 of the “Spontanmeldung” questionnaire, i.e., identified as “Spontanmeldung_v1_f2_1”, “Spontanmeldung_v1_f2_2”, “Spontanmeldung_v2_f2_1”, “Spontanmeldung_v2_f2_2”, “Spontanmeldung_v2_f2_3”, “Spontanmeldung_v2_f2_4”) are removed.

All other mismatches are explained by the absence of the corresponding questionnaire from the code book. Those questionnaires are sometimes found in the backup and sometimes aren’t. (“Nutzerakzeptanz” is there but only in version 2.)

The solution chosen consists in removing all answers to questions not found in the code book: they are useless if one doesn’t know the corresponding question.

5.8.2 Answers (without backup)

5.8.2.1 Empty fields

To limit computation times, this section is based on a random sample of 10% of answers.

Before data set operations (filtering, merging), of the 120991648 values stored in answers, 36.74% are missing.

In particular there are many entries with same question_id and participant_id, many questionnaire_date months apart and answer_values_code = answer_values == "[]". Overall 58.41% have empty codes and values. However there are some with answer dates (0% of entries with empty codes and values) and, of those, some with answers (0% of entries with empty codes and values).

Overall 58.41% of entries have at the same time empty answer, answer date, answer codes and answer values. Overall 98.15% of data have a missing answer, 95.79% have missing answer date.

No processing is done in that regards. In the future, it might make sense to separate in one data set only useful for quality assessment, and one useful both for quality assessment and scientific questions. Furthermore there surely is a more efficient way to store, here the code book could maybe help.

5.8.2.2 Multiple answers

To limit computation times, this section is based on a random sample of 10% of answers.

Before data set operations (filtering, merging):

When present the fourth and last number in question_id, of the form "_a*", is which time the question has been raised or answered by the same participant for the same questionnaire (they have a certain amount of time to change an answer once). Then the answer with highest number is the latest and the one to keep.

However in 4.14% of the time we don’t find the expected combinations of either no "_a*", or "_a1" and "_a2" once each for each questionnaire date. (This is after removing test participants.) This is most often one question appearing only once with either "_a1" and "_a2", but a few other cases appear as well (0.33% overall). Presumably the vast majority of those have one or two missing answer dates.

Processing:

Remove "_a*" from the question name storing the * as number (set to 0 if there is no suffix "_a*"), and keep the entry with highest number. Stop here pending further discussions. Check that, when defined, questions with _a2 always have later answers than same question with _a1, and similarly for _a* and no suffix.

One could go further and keep only the most recent answer for a given participant, question and questionnaire date-time, where an entry with an answer date has precedence over an entry with missing answer date. But this should be discussed and deeper investigated first.

6 Technical considerations

6.1 Environment

Here is the session information on the computer that ran the processing pipeline:

## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Ventura 13.3.1
## 
## Matrix products: default
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] DT_0.26      dplyr_1.1.0  tidyr_1.2.1  tibble_3.1.8
## 
## loaded via a namespace (and not attached):
##  [1] rstudioapi_0.14   janitor_2.1.0     magrittr_2.0.3    tidyselect_1.2.0  timechange_0.1.1  R6_2.5.1          rlang_1.0.6       fastmap_1.1.0     fansi_1.0.4       stringr_1.5.0     tools_4.2.2       utf8_1.2.3        cli_3.6.0         withr_2.5.0       ellipsis_0.3.2    htmltools_0.5.3   yaml_2.3.6        digest_0.6.30     lifecycle_1.0.3   purrr_1.0.1       htmlwidgets_1.5.4 vctrs_0.5.2       snakecase_0.11.0  glue_1.6.2        stringi_1.7.8     compiler_4.2.2    pillar_1.8.1      generics_0.1.3    lubridate_1.9.0   pkgconfig_2.0.3

6.2 Processing times

(All computation times on a MacBook Pro 13-inch, 2020, CPU: 2.3 GHz Quad-Core Intel Core i7, Memory: 32 GB 3733 MHz LPDDR4X. The orders of magnitude should not change much for comparable hardware.)

The whole processing and NAKO export takes 1 hour 44 minutes.

Exporting the complete output of the pipeline (i.e., saving the workspace) takes a few more minutes.

Duration individual steps:

  • loading of raw data (see config about whether this was from initial files or from native RDS): 27 seconds
  • read and process PIA code book: 6 seconds
  • get and standardize variable names: 3 seconds
  • standardize data: 1 second
  • clean data: 1 hour 6 minutes
  • transform datasets: 19 minutes
  • prepare for NAKO: 10 minutes
  • save processed datasets: 42 seconds
  • export for NAKO: 3 minutes
  • generate samples for NAKO: 2 minutes
  • check variable names and matching: 8 seconds
  • check PIA answers: 2 minutes

Generating this report took about: 3 minutes.

6.3 Memory

Memory usage is not optimized and similar, large objects stored without being overwritten. A quick look suggests at 16GB of memory (RAM) could be necessary but should be sufficient, but this hasn’t been tested. (The pipeline was run on a computer with 32GB memory.)

Surely an effective way to spare memory would be to overwrite the dataset list at each processing step, e.g., replace in pipeline.R the different variables dataset_list_* with a single variable dataset_list.

7 Acknowledgments

Parts of the code used for data processing were inspired by and/or checked against code previously written by Irina Jansen. Requirements and many details were contributed and explained by Stefanie Castell, Jana Heise, and Irina Jansen.